Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 10 de 10
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
PLoS Genet ; 17(8): e1009689, 2021 08.
Artigo em Inglês | MEDLINE | ID: mdl-34383745

RESUMO

Elucidating the transcriptional regulatory networks that underlie growth and development requires robust ways to define the complete set of transcription factor (TF) binding sites. Although TF-binding sites are known to be generally located within accessible chromatin regions (ACRs), pinpointing these DNA regulatory elements globally remains challenging. Current approaches primarily identify binding sites for a single TF (e.g. ChIP-seq), or globally detect ACRs but lack the resolution to consistently define TF-binding sites (e.g. DNAse-seq, ATAC-seq). To address this challenge, we developed MNase-defined cistrome-Occupancy Analysis (MOA-seq), a high-resolution (< 30 bp), high-throughput, and genome-wide strategy to globally identify putative TF-binding sites within ACRs. We used MOA-seq on developing maize ears as a proof of concept, able to define a cistrome of 145,000 MOA footprints (MFs). While a substantial majority (76%) of the known ATAC-seq ACRs intersected with the MFs, only a minority of MFs overlapped with the ATAC peaks, indicating that the majority of MFs were novel and not detected by ATAC-seq. MFs were associated with promoters and significantly enriched for TF-binding and long-range chromatin interaction sites, including for the well-characterized FASCIATED EAR4, KNOTTED1, and TEOSINTE BRANCHED1. Importantly, the MOA-seq strategy improved the spatial resolution of TF-binding prediction and allowed us to identify 215 motif families collectively distributed over more than 100,000 non-overlapping, putatively-occupied binding sites across the genome. Our study presents a simple, efficient, and high-resolution approach to identify putative TF footprints and binding motifs genome-wide, to ultimately define a native cistrome atlas.


Assuntos
Pegada de DNA/métodos , Regiões Promotoras Genéticas , Fatores de Transcrição/metabolismo , Zea mays/genética , Sítios de Ligação , Sequenciamento de Cromatina por Imunoprecipitação , Sequenciamento de Nucleotídeos em Larga Escala , Proteínas de Plantas/genética , Proteínas de Plantas/metabolismo , Elementos Reguladores de Transcrição , Sequenciamento Completo do Genoma
2.
BMC Genomics ; 21(1): 773, 2020 Nov 10.
Artigo em Inglês | MEDLINE | ID: mdl-33167858

RESUMO

BACKGROUND: Information on protein-protein interactions affected by mutations is very useful for understanding the biological effect of mutations and for developing treatments targeting the interactions. In this study, we developed a natural language processing (NLP) based machine learning approach for extracting such information from literature. Our aim is to identify journal abstracts or paragraphs in full-text articles that contain at least one occurrence of a protein-protein interaction (PPI) affected by a mutation. RESULTS: Our system makes use of latest NLP methods with a large number of engineered features including some based on pre-trained word embedding. Our final model achieved satisfactory performance in the Document Triage Task of the BioCreative VI Precision Medicine Track with highest recall and comparable F1-score. CONCLUSIONS: The performance of our method indicates that it is ideally suited for being combined with manual annotations. Our machine learning framework and engineered features will also be very helpful for other researchers to further improve this and other related biological text mining tasks using either traditional machine learning or deep learning based methods.


Assuntos
Mineração de Dados , Processamento de Linguagem Natural , Mapeamento de Interação de Proteínas , Aprendizado de Máquina , Mutação
3.
PLoS Comput Biol ; 16(11): e1007450, 2020 11.
Artigo em Inglês | MEDLINE | ID: mdl-33156882

RESUMO

Reusability is part of the FAIR data principle, which aims to make data Findable, Accessible, Interoperable, and Reusable. One of the current efforts to increase the reusability of public genomics data has been to focus on the inclusion of quality metadata associated with the data. When necessary metadata are missing, most researchers will consider the data useless. In this study, we developed a framework to predict the missing metadata of gene expression datasets to maximize their reusability. We found that when using predicted data to conduct other analyses, it is not optimal to use all the predicted data. Instead, one should only use the subset of data, which can be predicted accurately. We proposed a new metric called Proportion of Cases Accurately Predicted (PCAP), which is optimized in our specifically-designed machine learning pipeline. The new approach performed better than pipelines using commonly used metrics such as F1-score in terms of maximizing the reusability of data with missing values. We also found that different variables might need to be predicted using different machine learning methods and/or different data processing protocols. Using differential gene expression analysis as an example, we showed that when missing variables are accurately predicted, the corresponding gene expression data can be reliably used in downstream analyses.


Assuntos
Curadoria de Dados , Expressão Gênica , Metadados , Biologia Computacional
4.
Genome Biol ; 21(1): 165, 2020 07 06.
Artigo em Inglês | MEDLINE | ID: mdl-32631399

RESUMO

BACKGROUND: The functional genome of agronomically important plant species remains largely unexplored, yet presents a virtually untapped resource for targeted crop improvement. Functional elements of regulatory DNA revealed through profiles of chromatin accessibility can be harnessed for fine-tuning gene expression to optimal phenotypes in specific environments. RESULT: Here, we investigate the non-coding regulatory space in the maize (Zea mays) genome during early reproductive development of pollen- and grain-bearing inflorescences. Using an assay for differential sensitivity of chromatin to micrococcal nuclease (MNase) digestion, we profile accessible chromatin and nucleosome occupancy in these largely undifferentiated tissues and classify at least 1.6% of the genome as accessible, with the majority of MNase hypersensitive sites marking proximal promoters, but also 3' ends of maize genes. This approach maps regulatory elements to footprint-level resolution. Integration of complementary transcriptome profiles and transcription factor occupancy data are used to annotate regulatory factors, such as combinatorial transcription factor binding motifs and long non-coding RNAs, that potentially contribute to organogenesis, including tissue-specific regulation between male and female inflorescence structures. Finally, genome-wide association studies for inflorescence architecture traits based solely on functional regions delineated by MNase hypersensitivity reveals new SNP-trait associations in known regulators of inflorescence development as well as new candidates. CONCLUSIONS: These analyses provide a comprehensive look into the cis-regulatory landscape during inflorescence differentiation in a major cereal crop, which ultimately shapes architecture and influences yield potential.


Assuntos
Montagem e Desmontagem da Cromatina , Regulação da Expressão Gênica no Desenvolvimento , Regulação da Expressão Gênica de Plantas , Inflorescência/crescimento & desenvolvimento , Zea mays/crescimento & desenvolvimento , Genoma de Planta , Estudo de Associação Genômica Ampla , Inflorescência/metabolismo , Nuclease do Micrococo , Regiões Promotoras Genéticas , RNA Longo não Codificante/metabolismo , Zea mays/metabolismo
5.
Database (Oxford) ; 20192019 01 01.
Artigo em Inglês | MEDLINE | ID: mdl-30624652

RESUMO

Information about the interactions between chemical compounds and proteins is indispensable for understanding the regulation of biological processes and the development of therapeutic drugs. Manually extracting such information from biomedical literature is very time and resource consuming. In this study, we propose a computational method to automatically extract chemical-protein interactions (CPIs) from a given text. Our method extracts CPI pairs and CPI triplets from sentences, where a CPI pair consists of a chemical compound and a protein name, and a CPI triplet consists of a CPI pair along with an interaction word describing their relationship. We extracted a diverse set of features from sentences that were used to build multiple machine learning models. Our models contain both simple features, which can be directly computed from sentences, and more sophisticated features derived using sentence structure analysis techniques. For example, one set of features was extracted based on the shortest paths between the CPI pairs or among the CPI triplets in the dependency graphs obtained from sentence parsing. We designed a three-stage approach to predict the multiple categories of CPIs. Our method performed the best among systems that use non-deep learning methods and outperformed several deep-learning-based systems in the track 5 of the BioCreative VI challenge. The features we designed in this study are informative and can be applied to other machine learning methods including deep learning.


Assuntos
Biologia Computacional/métodos , Mineração de Dados/métodos , Bases de Dados de Compostos Químicos , Bases de Dados de Proteínas , Aprendizado de Máquina , Humanos , Preparações Farmacêuticas/química , Preparações Farmacêuticas/metabolismo , Proteínas/química , Proteínas/metabolismo , Semântica , Software
6.
Sci Rep ; 8(1): 16335, 2018 11 05.
Artigo em Inglês | MEDLINE | ID: mdl-30397274

RESUMO

Molecular mechanisms underlying the health disparity of prostate cancer (PCa) have not been fully determined. In this study, we applied bioinformatic approach to identify and validate dysregulated genes associated with tumor aggressiveness in African American (AA) compared to Caucasian American (CA) men with PCa. We retrieved and analyzed microarray data from 619 PCa patients, 412 AA and 207 CA, and we validated these genes in tumor tissues and cell lines by Real-Time PCR, Western blot, immunocytochemistry (ICC) and immunohistochemistry (IHC) analyses. We identified 362 differentially expressed genes in AA men and involved in regulating signaling pathways associated with tumor aggressiveness. In PCa tissues and cells, NKX3.1, APPL2, TPD52, LTC4S, ALDH1A3 and AMD1 transcripts were significantly upregulated (p < 0.05) compared to normal cells. IHC confirmed the overexpression of TPD52 (p = 0.0098) and LTC4S (p < 0.0005) in AA compared to CA men. ICC and Western blot analyses additionally corroborated this observation in PCa cells. These findings suggest that dysregulation of transcripts in PCa may drive the disparity of PCa outcomes and provide new insights into development of new therapeutic agents against aggressive tumors. More studies are warranted to investigate the clinical significance of these dysregulated genes in promoting the oncogenic pathways in AA men.


Assuntos
Negro ou Afro-Americano/genética , Regulação Neoplásica da Expressão Gênica , Neoplasias da Próstata/etnologia , Neoplasias da Próstata/genética , Adulto , Negro ou Afro-Americano/estatística & dados numéricos , Linhagem Celular Tumoral , Humanos , Masculino , Pessoa de Meia-Idade , Análise de Sequência com Séries de Oligonucleotídeos , Prognóstico , Neoplasias da Próstata/diagnóstico , Neoplasias da Próstata/patologia , Transdução de Sinais/genética , População Branca/genética , População Branca/estatística & dados numéricos
7.
Data Brief ; 20: 358-363, 2018 Oct.
Artigo em Inglês | MEDLINE | ID: mdl-30175199

RESUMO

Presented here are data from Next-Generation Sequencing of differential micrococcal nuclease digestions of formaldehyde-crosslinked chromatin in selected tissues of maize (Zea mays) inbred line B73. Supplemental materials include a wet-bench protocol for making DNS-seq libraries, the DNS-seq data processing pipeline for producing genome browser tracks. This report also includes the peak-calling pipeline using the iSeg algorithm to segment positive and negative peaks from the DNS-seq difference profiles. The data repository for the sequence data is the NCBI SRA, BioProject Accession PRJNA445708.

8.
BMC Med Inform Decis Mak ; 18(Suppl 2): 42, 2018 07 23.
Artigo em Inglês | MEDLINE | ID: mdl-30066644

RESUMO

BACKGROUND: Relationships between bio-entities (genes, proteins, diseases, etc.) constitute a significant part of our knowledge. Most of this information is documented as unstructured text in different forms, such as books, articles and on-line pages. Automatic extraction of such information and storing it in structured form could help researchers more easily access such information and also make it possible to incorporate it in advanced integrative analysis. In this study, we developed a novel approach to extract bio-entity relationships information using Nature Language Processing (NLP) and a graph-theoretic algorithm. METHODS: Our method, called GRGT (Grammatical Relationship Graph for Triplets), not only extracts the pairs of terms that have certain relationships, but also extracts the type of relationship (the word describing the relationships). In addition, the directionality of the relationship can also be extracted. Our method is based on the assumption that a triplet exists for a pair of interactions. A triplet is defined as two terms (entities) and an interaction word describing the relationship of the two terms in a sentence. We first use a sentence parsing tool to obtain the sentence structure represented as a dependency graph where words are nodes and edges are typed dependencies. The shortest paths among the pairs of words in the triplet are then extracted, which form the basis for our information extraction method. Flexible pattern matching scheme was then used to match a triplet graph with unknown relationship to those triplet graphs with labels (True or False) in the database. RESULTS: We applied the method on three benchmark datasets to extract the protein-protein-interactions (PPIs), and obtained better precision than the top performing methods in literature. CONCLUSIONS: We have developed a method to extract the protein-protein interactions from biomedical literature. PPIs extracted by our method have higher precision among other methods, suggesting that our method can be used to effectively extract PPIs and deposit them into databases. Beyond extracting PPIs, our method could be easily extended to extracting relationship information between other bio-entities.


Assuntos
Algoritmos , Armazenamento e Recuperação da Informação/métodos , Processamento de Linguagem Natural , Proteínas/metabolismo , Bases de Dados Factuais
9.
BMC Bioinformatics ; 19(1): 131, 2018 04 11.
Artigo em Inglês | MEDLINE | ID: mdl-29642840

RESUMO

BACKGROUND: Identification of functional elements of a genome often requires dividing a sequence of measurements along a genome into segments where adjacent segments have different properties, such as different mean values. Despite dozens of algorithms developed to address this problem in genomics research, methods with improved accuracy and speed are still needed to effectively tackle both existing and emerging genomic and epigenomic segmentation problems. RESULTS: We designed an efficient algorithm, called iSeg, for segmentation of genomic and epigenomic profiles. iSeg first utilizes dynamic programming to identify candidate segments and test for significance. It then uses a novel data structure based on two coupled balanced binary trees to detect overlapping significant segments and update them simultaneously during searching and refinement stages. Refinement and merging of significant segments are performed at the end to generate the final set of segments. By using an objective function based on the p-values of the segments, the algorithm can serve as a general computational framework to be combined with different assumptions on the distributions of the data. As a general segmentation method, it can segment different types of genomic and epigenomic data, such as DNA copy number variation, nucleosome occupancy, nuclease sensitivity, and differential nuclease sensitivity data. Using simple t-tests to compute p-values across multiple datasets of different types, we evaluate iSeg using both simulated and experimental datasets and show that it performs satisfactorily when compared with some other popular methods, which often employ more sophisticated statistical models. Implemented in C++, iSeg is also very computationally efficient, well suited for large numbers of input profiles and data with very long sequences. CONCLUSIONS: We have developed an efficient general-purpose segmentation tool and showed that it had comparable or more accurate results than many of the most popular segment-calling algorithms used in contemporary genomic data analysis. iSeg is capable of analyzing datasets that have both positive and negative values. Tunable parameters allow users to readily adjust the statistical stringency to best match the biological nature of individual datasets, including widely or sparsely mapped genomic datasets or those with non-normal distributions.


Assuntos
Algoritmos , Bases de Dados Genéticas , Epigenômica , Simulação por Computador , Variações do Número de Cópias de DNA/genética , Desoxirribonucleases/metabolismo , Genoma , Humanos , Modelos Estatísticos , Neoplasias/genética , Zea mays/genética
10.
Sci Rep ; 7: 43294, 2017 03 03.
Artigo em Inglês | MEDLINE | ID: mdl-28256629

RESUMO

Choosing the optimal chemotherapy regimen is still an unmet medical need for breast cancer patients. In this study, we reanalyzed data from seven independent data sets with totally 1079 breast cancer patients. The patients were treated with three different types of commonly used neoadjuvant chemotherapies: anthracycline alone, anthracycline plus paclitaxel, and anthracycline plus docetaxel. We developed random forest models with variable selection using both genetic and clinical variables to predict the response of a patient using pCR (pathological complete response) as the measure of response. The models were then used to reassign an optimal regimen to each patient to maximize the chance of pCR. An independent validation was performed where each independent study was left out during model building and later used for validation. The expected pCR rates of our method are significantly higher than the rates of the best treatments for all the seven independent studies. A validation study on 21 breast cancer cell lines showed that our prediction agrees with their drug-sensitivity profiles. In conclusion, the new strategy, called PRES (Personalized REgimen Selection), may significantly increase response rates for breast cancer patients, especially those with HER2 and ER negative tumors, who will receive one of the widely-accepted chemotherapy regimens.


Assuntos
Antineoplásicos/administração & dosagem , Neoplasias da Mama/tratamento farmacológico , Neoplasias da Mama/patologia , Tratamento Farmacológico/métodos , Medicina de Precisão/métodos , Transcriptoma , Antraciclinas/administração & dosagem , Linhagem Celular Tumoral , Docetaxel , Feminino , Humanos , Masculino , Modelos Biológicos , Paclitaxel/administração & dosagem , Taxoides/administração & dosagem
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...